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Abstract 

r^. ' We study modifications of the Viterbi Training (VT) algorithm to estimate emission parameters 

in Hidden Markov Models (HMM) in general, and in mixure models in particular. Motivated by 

\& ' applications of VT to HMM that are used in speech recognition, natural language modeling, image 

o : 

^f , analysis, and bioinformatics, we investigate a possibility of alleviating the inconsistency of VT while 

controlling the amount of extra computations. Specifically, we propose to enable VT to asymptotically 
fix the true values of the parameters as does the EM algorithm. This relies on infinite Viterbi 
alignment and an associated with it limiting probability distribution. This paper, however, focuses 
on mixture models, an important case of HMM, wherein the limiting distribution can always be 
computed exactly; finding such limiting distribution for general HMM presents a more challenging 
task under our ongoing investigation. 

A simulation of a univariate Gaussian mixture shows that our central algorithm (VA1) can dra- 
matically improve accuracy without much cost in computation time. 

We also present VA2, a more mathematically advanced correction to VT, verify by simulation its 
fast convergence and high accuracy; its computational feasibility remains to be investigated in future 
work. 

Keywords: Viterbi Training algorithm; hidden Markov models; mixture models; EM algorithm; consis- 
tency; parameter estimation 
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1 Introduction 



Motivated by applications of the Viterbi Training (VT) algorithm to estimate parameters of Hidden 
Markov Models in speech recognition (Huang et al. 1990|. Ney et al. 1994, Rabiner and Juang 1993, Ra 



bincr ct al. 1986| , Stcinbiss et al. 1995| , Strom et al. 1999 ), natural language modeling ( Och and Ney 2000| ), 
image analysis ( Li ct al. 2000| ), bioinformatics ( Ehret et al. [2001 , Ohlcr et al. 2001 ), and by connections 
with constrained vector quantization ( Chou ct al. 1989] , |Gray et al. 2000 ), we are interested in improving 
accuracy of the VT estimators while preserving its essential computational advantages. 

Let 9i be the emission parameters of HMM with states I € S = {1, ...,K}. The standard method 
to compute a maximum likelihood estimator of is the EM algorithm that in the HMM context is also 
known as the Baum-Welch or forward-backward algorithm (Baum and Pctric 1966, Bilmee 1998, Huang 



ct al. 1990| , Jclinek 2001, Rabiner 1989, Rabiner and Juang 1992, Young 2003). Since EM can in practice 
be computationally expensive, it is commonly replaced by Viterbi Training. VT effectively replaces the 
computationally costly expectation (E) step of EM by an appropriate maximization step that is compu- 
tationally less intensive. An important example of successful and elaborate application of VT in industry 



is Philips speech recognition systems (Ney et al. 1994). 

There are also variations of VT that use more than one best alignment, or several perturbations of 



the best alignment (Och and Ney 2000). The improvements that we explore are, however, of a different 



nature: roughly, we increase estimation accuracy purely by means of analytic calculations and do not 
require computing more than one optimal alignment. 

VT is inferior to EM in terms of accuracy because the VT estimators need not be (local) maximum 
likelihood estimators (VT does not necessarily increase the likelihood); which leads to bias and inconsis- 
tency (§|). 

Given current parameter values, VT first finds a Viterbi alignment that is a sequence of hidden states 
maximizing the likelihood of the observed data. Observations assumed to have been emitted from state 
I, are regarded as an i.i.d. sample from p, the corresponding emission distribution. These observations 
produce P™, the empirical version of p, and ultimately jii, a maximum likelihood estimate of #/. /} is 
then used to find an alignment in the next step, and so forth. It can be shown that in general this 
procedure converges in finitely many steps; also, it is usually much faster than EM. 

In speech recognition, the same training procedure was already described by L. Rabiner et al. in 
(Juang and Rabiner 1990] , Rabiner et al. 1986) (see also (Rabiner 1989, Rabiner and Juang 1993)) who 
considered his procedure as a variation of the Lloyd algorithm from vector quantization, and referred to it 
as segmential K-means training. The analogy with vector quantization is especially pronounced when the 
underlying chain is a sequence of i.i.d. variables in which case the observations are simply an i.i.d. sample 
from a mixture distribution (§|3|). For such mixture models, Viterbi training was also described by R. 



Gray et al. in (Chou ct al. 1989), where the training algorithm was considered in the vector quantization 



context under the name of entropy constrained vector quantization (ECVQ). (See also (Gray et al. 2000) 
for more recent developments in this theory.) 

Our goal is to alleviate the inconsistency of the VT estimators while preserving the fast convergence 
and computational feasibility of the baseline VT algorithm. To this end, we note that 9* , the true pa- 
rameters, are asymptotically a fixed point of EM but not of VT §0, §0. We thus attempt to adjust VT 
in order to restore this property by studying the asymptotics of P". Thus, we discuss the existence of 
Qi leS 

PP^Qi, 1&S a.s., (1) 

first in the general HMM context - §|[ and then in the special case of mixture models - §|3| If such 

limiting measures exist, then under certain continuity assumptions, the estimators \±i will converge to /z;, 

where 

Hi = argmax / \nfi(0 h x)Qi(dx). 
i J 

Taking into account the difference between \i\ and the true parameter, the appropriate adjustment of the 
Viterbi training can now be defined (§0). 

However, the asymptotic behavior of P™ is not in general straightforward and its analysis requires an 



extension of the definition of Viterbi alignment at infinitum, (Lember and Koloydenkc 2004). With the 
infinite alignment one can prove the existence of the limiting measures Qi (m) , which is essential for the 
general definition of the adjusted Viterbi training. 

To implement these ideas in practice, a closed form of Qi (or fa) as a function of the true parameters 
is necessary. The measures Qi also depend on the transition as well as on the emission models, and 
computing Qi can be very difficult. However, in the special case of mixture models §|3|, the measures Qi 
are easier to find. We are also motivated by the continuing interest of others in computational efficiency 



and accuracy of parameter estimation in mixture models ( Dias and Wedcl 2004 , Lin et al. 2004 ) . In §g, we 



describe the adjusted Viterbi training (VA1) for the mixture case, which we view as the main contribution 
of this paper: VA1 recovers the asymptotic fixed point property and, since its adjustment function does 
not depend on data, each iteration of VA1 enjoys the same order of computational complexity (in terms of 
the sample size) as VT. Moreover, for commonly used mixtures, such as, for example (Example pTj), mix- 
tures of multivariate normal distributions with unknown means and known covariances, the adjustment 
function is available in a closed form requiring integration with the mixture densities. Depending on the 
dimension of the emission variates and on the number of components, and on the available computational 
resources, one can vary the accuracy of the adjustment. We reiterate that, unlike the computations of 
the EM algorithm, computations of the adjustment do not involve evaluation and subsequent summation 
of the mixture density at every data point. 



We first introduce these ideas for the case of known mixture weights and then extend them in §4.1 to 
the case of unknown weights. 

To test our theory, in §0 we simulate a mixture of two univariate normal distributions with unit 
variance, unknown means, and unequal but comparable weights. The main goal of our simulations is 
to compare the performances of VT, VAl, and EM in terms of the accuracy, convergence, amount of 
computations per iteration, and the total amount of computations. The simulations are performed under 



different conditions on the initial guess when the weights are assumed to be known, §5.1, and unknown, 



§ |5.2| , and the results (§ p.3| ) are consistently in favor of VAl. 

In §|4|, we propose VA2, a more mathematically advanced correction to VT; we verify its fast conver- 
gence and high accuracy on the simulated data in §|5|, and intend to elaborate on its computationally 
feasible implementations in future work. A concluding summary is given in §|6|. 



2 General HMM 

Let Y be a Markov chain with finite state space S. We assume Y to be irreducible and aperiodic with 
transition matrix P = (pij) and initial distribution 7r that is also the stationary distribution of Y. To 
every state I G S there corresponds an emission distribution Pi on (X,B), a separable metric space and 
the corresponding Borel cr-algebra. Let //, the density of p with respect to some reference measure A 
(for instance, the Lebesgue measure), be known up to the parametrization fi(x; 0{). When Y is in state I, 
an observation according to Pi(0*) and independent of everything else is emitted, with 9* — (#*, . . . , 9* K ) 
being the unknown true parameters. 

Thus, for any y = j/i, j/2, • • •, a realization of Y , there corresponds a sequence of independent random 
variables, X\, X%, . . ., where X n has distribution P Vn . Note that we only observe X = X\, X%, . . . and 
the realization y is unknown (Y is hidden). 

The distribution of X is completely determined by the chain parameters (P, n) and the emission 
distributions p, I G S. The process X is also mixing and, therefore, ergodic. We now recall Viterbi 
Alignment and Training. 

Let Xi, . . . ,x n be first n observations on X . Let A(qi, . . . , q n ; Xi, • ■ ■ , x n ] 9) be the likelihood function 
P(Yi =q tl i = l,...,n) J]" =1 f qi (Xi] 9 qi ), q t € S. 

The Viterbi alignment is any sequence of states qi, ■ ■ ■ ,q n £ S that maximizes the likelihood of 
observing xi, . . . ,x n , 9 being fixed. Thus, for a fixed 9, the Viterbi alignment is a maximum- likelihood 
estimator of the realization o/Yi, . . . , Y n given x\, . . . ,X n . In the following, the Viterbi alignment will be 
referred to as the alignment. For each n > 1, let V be the set of alignments: 

V(»i, . . . ,Xn] 0) = {v G S n : V™ G S n A(w; x u . . . , x n ; 9) > A(u>; x u ■ ■ ■ , x n ; 9)}. (2) 



Any map v : X n t— ► V(xi, . . . , i„; 9) will also be called an alignment. Further, unless explicitly specified, 
vg will denote an arbitrary element of V(xi, . . . , x n ; 9). 

Viterbi Training 

1.) Choose an initial value 0° = (0?, . . . , 6^). 

2.) Given 9^ j > 0, find the alignment 

Vgj(xi, ...,X n ) = {Vl, ■ ■ ■ ,V n ) 

and partition Xi, . . . , x n into K subsamples, with Xk going to the I th subsample if and only if Vk = I- 
Equivalently, define K empirical measures 

^;^):=^7^, AeB, leS, (3) 

where I a stands for the indicator function of set A. 

3.) For every subsample find the MLE given by: 

W(0*) - arg max /m/K^Pf^;^'), (4) 



and take 6*^ = ili(9 J ), I e 5. If for some I € S, Vi ^ I for any i = 1, . . . ,n (/ subsample is 



empty) , then the empirical measure P" is formally undefined, in which case we take 9\ — 9\ . We 



omit this exceptional case in the following discussion. 

VT can be interpreted as follows. Suppose that at step j, 9^ — 9* and hence vgj is obtained using the true 
parameters. The training is then based on the assumption that the alignment v(xi, . . . , x n ) = (vi, ■ ■ ■ , v n ) 
is correct, i.e., Vi = Y^, i = 1, . . . ,n. In this case, the empirical measures PJ l {9 :) ) 1 I € S would be obtained 
from the i.i.d. sample generated from P;(0*), and the MLE p,i(9*) would be the natural estimator to 
use. Clearly, under these assumptions P[ l (9*) => Pi{9*) a.s. and, provided that {fi(-',9) : 9 £ 0/} 
is a P/-Glivenko-Cantelli class and 0/ is equipped with some suitable metric, rirn n _» 00 /*/(#*) = 9* a.s. 
Hence, if n is sufficiently large, then P™ sw P and 9j +1 = fi t {9*) w 6»,* = 0f, V^ i.e. 6"' = 9* would be 
(approximately) a fixed point of the training algorithm. 

A weak point of the previous argument is that the alignment in general is not correct even when 
the parameters used to find it are, i.e. generally Vi ^ Yi. In particular, this implies that the empirical 
measures P[ L {9*) are not obtained from an i.i.d. sample taken from Pi(9*). Hence, we have no reason to 
believe that P( l (9*) => Pi(9*) a.s. and lim n ._ >oo /xj(0*) = 9"[ a.s. Moreover, we do not even know whether 
the sequences of empirical measures {P[ l (9*} and MLE estimators {fii{0*)} converge (a.s.) at all. 



In (Lcmber and Koloydenkc 2004), we prove the existence of limiting probability measures Qi(6,9*), 



I £ S, that depend on 9, the parameters used to find the alignment vg(x\, . . . , x n ), and on 9* , the true 



parameters with which the random samples are generated. Namely, these Qi, I € S will be such that for 
every I 

Pr(8*)^Qi(6*,0*), a.s.. (5) 

Suppose also that the parameter space 0/ is equipped with some metric. Then, under certain consistency 
assumptions on classes T\ = {fi(&i) ■ ®i € &i}, the convergence 

lim £i(0*)=W(0*,0*) a - s - ( 6 ) 

n — »oo 

can be deduced from (|5|), where 

Mi (#,r) d = argmax / ']nfi{x;9[)Qi{dx;9,9*). (7) 

We also show that in general, for the baseline Viterbi training Qi(9* , 9*) ^ Pi (8*), implying ni(9*, 9*) ^ 
9f . In an attempt to reduce the bias 9f — fii(9*,9*), we next propose the adjusted Viterbi training. 
Suppose (|5|) and (g) hold. Based on (0), we now consider the mapping 

w(«) = w(M)> 1 = 1,..., if. (8) 

Since this function is independent of the sample, we can define the following correction for the bias: 

A,(0) = 0i- W (0), i = l Jf. (9) 

VA1 - Adjusted Viterbi Training 

1.) Choose an initial value 6° = (0?, . . . , 6° K ). 

2.) Given J ', perform the alignment and define K empirical measures Pp(9^) as in (|3|). 

3.) For every P™, find fii{6^) as in (||) and for each I, define 9j + = )2i(9 j ) + A;(0 J ), where A; is defined 
VI E S in (|). 

Note that, as desired, for n sufficiently large, the adjusted training algorithm has 9* as its (approximately) 
fixed point: Indeed, suppose 9 j = 9*. From (g), #j(0J) = (n{9*) w /^(0*) = w(0 J ), for all Z e S. Hence, 

#/ +1 = fn(9*) + A,(0*) « W (0*, 0*) + A,(0* ) = 0,* = J ', I € S. (10) 

3 Mixture model 

In general, no closed form for the distribution Qi(9*,0*) in (0) is available. Therefore, the mapping (ra) 
may be impossible to determine exactly and approximations of Qi should be used for the adjustments of 
Viterbi training (§|2j). However, in the case of the mixture model, the distributions Qi are straightforward 



to find and the adjusted Viterbi training can therefore be fully specified. In this model, Y, the underlying 
Markov chain, is a sequence of i.i.d. discrete random variables with the state space S = {1, . . . ,K} of 
mixture components. Thus, the transition probabilities are pij = Pj, i,j € S, where pj are mixture weights. 
To each component I G S there corresponds a probability distribution p with density fi = /;(•;#*), where 
9* are the true parameters. Unless explicitly stated otherwise, the mixture weights pi will be assumed 
to be known. Such a model produces observations x±, . . . , x n , that are regarded as an i.i.d. sample from 
the mixture distribution P with density 

K K 

For any set of parameters 9 = (#i, . . .9k), the alignment vg can be obtained via a Voronoi partition 
S(9)={S 1 (9),...,S K (9)}, where 

Si(0) = {x : Pl h(x;ei) > PjfjixiBj), Vj G S} (12) 

5i(») = {a;:p,/i(x;ft)>Pj/i(»;^). Vj € 5}\(5i U . . . U 5,_ x ), l = 2,...,K. (13) 

Now, the alignment can be defined as follows: t>e(a>) = / if and only if a; € Si(9). In particular, given the 
Voronoi partition S(9) = {Si, . . . , Si}, the empirical measures P" (||) are 

^9) = &^# , ^6, leS. (14) 

L l =i / s,(e)(^) 

Thus, given the same partition, fci(6) M), the sub-sample MLE for component I, becomes 

p,i(6) = argmax / hi fi(x; 9' l )P n (dx), (15) 

where P n is the ordinary empirical measure associated with the given random sample. The convergence 
(a) then follows immediately from (JT3) . Indeed, for any 9, by virtue of the Strong Law of Large Numbers 
we have 

pn aJ . P(AnSi(0);9*) SsMnAffoFW*) HiPiJsMnAfiWiW*) 

™ ^ ' j PiSm-,0*) f Si(e) f(x;9*)d\(x) E^I Sl{e) M^9*)dX(x) ' 

Since <¥ is separable, it follows that P™ =>■ Q; a.s., where 

i 

are the densities of respective Qi(9, (9*)'s. 

Now it is clear that even when the partition S(9*) is obtained using the true parameters 9* , Qi(9* , 9*), 
the limiting distribution (density qi(x;9* ,9*)), can be different from p((9*), the desired distribution 
(density fi(x;9*)). Likewise, m{6*) (|j) can be different from 

9* = arg max / In/,(s; ^)/,(s; ^)dA(x). 



In order to see this, note that (uh and (B) in the context of the mixture model specialize to 

m(8,0*) = axgmax [ ln/jfoflO/fo^W*) (16) 

Hi{9) = argmax / In/lfe^Eft/i^Si))^), (17) 

respectively. We also emphasize that A can be significant which justifies the adjustment. 
Example 3.1 Let 



i=i 

where cf>(x; 9*) is the density of the d-variate normal distribution with identity covariance matrix and vector 

of means 8* £ M. d = 0/ for I = 1, 2, . . . , K . In this case, for each K-tuple of parameters 9 = {9\, . . . , 9k)> 

the decision-rule for the alignment is essentially as follows (disregarding possible ties): vg{x) = i if and 

only if \\x — 6i\\ < mim,- ||x — 9j\\. Thus, the decision regions in this case correspond to the Voronoi 

partition in its original sense, justifying our generalization of this term. Now, it can be easily seen that 

for all m — 1, . . . , d: 

Si-i jq.ffli x m4>(x; 9i)dxi ■ ■ ■ dx d 

W)) m = ^K SlS) M - w T— • (18) 

Li=i J Sl{9) 0{x;9i)dxi---dxd 

When d and K are large, the exact integration in ( Jlq ) still requires intensive computations, for which 

reason one may be interested in approximations of (Eq). In the context of the above example, one might 

think of the following approximations for A;(0) = 6i — fii(8)- 



1.) Approximate ( £, fi(x; 9i))I Sl (6) in @ by fi(9i,x)I Sl ( S ), so 

J^rfl-, x m <t>{.x; 9i)dx\ ■ ■ ■ dxd 

^ m W ,/ fl J rf— • ( 19 ) 

JSt(9) 9{x]9i)dxi ■ ■ -dx d 

This approximation is based on the limiting case when the components are "infinitely" far from each 
other. 

2.) If K > d, then some components are fully surrounded by others, namely, the partition cells corre- 
sponding to such "internal" components are bounded (Figure ID). It is then conceivable that A/'s that 
correspond to the bounded cells are less significant than the others, in which case one might only 
correct the estimators of the internal components. 

3.) Every Voronoi cell is determined by several hyperplanes and every such hyperplane H° gives rise to 
A; , a component of A; in the direction perpendicular to H^ and corresponding to the I th term in the 
sum in (p"8|). Thus, A; = J\- Aj (see, for example, I = 1 cell in Figure El). It may be reasonable to 
find only the "main direction" of correction, i.e. the largest Aj for each I (see, for example, the / = 2 
cell in Figure [l]) . 




Fig. 1. An example of the Voronoi partition for multiple components. True parameters 9* and correspond- 
ing /j,(9*) are marked with solid and transparent dots, respectively. For I = 1 component, Aj(9*), the 
correction components, are indicated. For I = 2 component, the main direction of the correction is indi- 
cated. It appears a reasonable approximation to neglect the corrections for the estimators corresponding 
to the bounded Voronoi regions. 



Remark 3.2 In Example 3.1, the decision regions correspond to the Voronoi partition in its original 
sense. Moreover, it is easy to see that in this particular case, the Viterbi training is none other than the 
well-known (generalized) Lloyd algorithm designed for finding vector quantizers, which in this case are 



also called Jf-means (see, e.g. (Sabine and Grai 198b\j). In this case, the estimators obtained by the 



Viterbi training are empirical K-means. The latter estimators enjoy certain desirable properties, and in 



particular they are consistent with respect to the population K-means (Pollarc 1981). However, they need 
not be consistent with respect to 9* , our parameters of interest. In the mixture case, the Viterbi training 
can always be considered as the (generalized) Lloyd algorithm, and the estimators obtained by Viterbi 



training can be regarded as (generalized) empirical K-means (Chou et al. 1989\ ). This observation links 



the study of Viterbi Training and related algorithms to the theory of vector quantization. 

4 VA2 — Adjustment of the second type 

The adjusted Viterbi training is designed to asymptotically fix the true parameter 9* , returning approx- 
imately the correct solution given this solution as the initial guess and given an infinitely large data 
sample: VA1(0*) rs 9* . VA2 goes further and attempts to maximally expand {9 : VA1(0) s=s 9*}, the set 
of parameter values that are asymptotically mapped to the true ones, to {9 : VA2(9) « 9*}. Specifically, 
if the algorithm ever arrives at S(9*), the Voronoi partition corresponding to the true parameters 9*, 
then we would like to coerce the adjusted estimates to return 9*. Let us explain these ideas in more 
detail. 

Let S* stand for S(9*), the true Voronoi partition (that also coincides with the Bayes decision bound- 
ary). The mapping 9 h-» S(9) is generally many-to-one, hence the set &(S*) = {9 : S(8) = S*} generally 
contains more than one element. (This also means that guessing S* , i.e. guessing any element from 
<d(S*), is generally easier than guessing 9* .) We now introduce VA2: 

Note first that fj,i(0, 9*) in (fL6[), as well as the estimate fii(9) in (fL5|), depends on 9 through S(9) only. 
However, the correction A;(0) = 9\ — ni(6, 9) does depend on 9 fully and hence would not generally work 
(in the sense of (J10|) ) for an arbitrary 6° € 0(<S*) unless #■?' — 9* . We now attempt to improve the first 
type of adjustment that is based on adding A(0 J ) to fi{9^). Namely, we propose the following iterative 
update for I — 1, . . . , K: Next, define Hi,e(s(e°))(9) ( a s function of 9 only) to be the restriction of m(8°, 9) 
to 0(S(9°)), and write Hi go(9) in place of the more cumbersome Mz,e(S(e°))($)- Let 

• +1 _ Jm^CM^)), if a unique ^(AK^)) exists 
\fj-i{9 j ) + Ai(9 j ) otherwise. 

For any Qi and 9*, the event that /i(0 J ', 9*) belongs to the range of ^(#? ,9) as a function of 9 G Q(S(9 j )) 



is of zero probability, as Example LI illustrates. Hence, the introduction of the individual inverses \i l gj 



10 



I = 1, . . . , K is essential, although still not always effective: In some mixture models (a mixture of normal 
distributions with unequal weights is one such example), for a fixed I, the event that fii(9 J ) belongs in 
the range of ni{Qi ,ff) (as function of 9 € 0(«S(0 J ))) need not occur with probability one for all #■?' and 
9*. This, and also the fact that the inverses in general need not have a closed form, or may require 
intensive computations, may reduce the attractiveness of the suggested method. Further discussion of 
the computational issues related to this method is outside the scope of this paper, except for mentioning 
the possibility of various, e.g. linear or quadratic, approximations of the above functions /xj~g . 

In order to better understand the meaning of the new adjustment, imagine that 9 J E 0(<S*). We 
would then expect for I = 1, . . . , K\ 

The above argument, of course, also depends on the regularity of the above inverses at /j,i(9*,9*) I = 
1, . . . , K , and in this regard our experiments in §ra provide encouraging results for an important model 
similar to the model in the following example: 

Example 4.1 Let f(x; 9*) — \(j){x — 9\) + h(j>(x — 9^), where is the density of the standard normal 
distribution. In this case any Voronoi partition is specified by a single parameter t = 0.5(#i + #2) solving 
4>{t—9i) = 4>{t—92) (ties are evidently inessential in this context). The true Voronoi partition corresponds 
to t* = 0.5{9l + 6*5). Given a Voronoi partition S{t(9)), Q(t) = {(t - a),(t + a) : a € K+}. Hence, 
restricted to 0(t), the function Hs(t)(9) — {p-i.s(t)(9), fJ-2,s(t)(9)) depends on one parameter only: Let a be 
this parameter and define fis( t ){9(a)) = (fii(a), /12(a)) as follows: fJ-i(a) = — a(l — 2$(— a)) — 2<f>(— a) +£, 
A*2(a) = 2t — /ii(a), where $ is the distribution function of the standard normal distribution. After 
calculating fix < fi2 from the data, the inversion equations of (Efl) become 

i-[a(l-2$(-a)) + 20(a)] =/*!, t + [a(l - 2$(-a)) + 20(a)] = /t 2 - (21) 

Obviously (Ell) has a (unique) solution if and only if fix, \±2 are symmetric with respect to t and the 



probability of this latter event is clearly zero under the model. Thus, as suggested in (20), we consider 
the equations separately: 

a(l -2$(-a)) + 20(a) = t - fn (22) 

o(l - 2$(-a)) + 20(a) = fa - t. (23) 

It can be shown that (|22| ) and ( |23|) have unique solutions, let us denote the latter by a\ and 02, respectively. 
The points t — ai and t + a2 will be now taken as the estimators of 6* and 9\ for the next step of iterations. 

VA2 

1.) Choose 8° = ($<{,..., 9° K ). 

11 



2.) Given 6°, find Sffi) and define empirical measures P™(0 J ) as m (Ft)- 
3.) For every P™, find (j,i(O j ,6*) as in @. 
4.) Update 6'- 7+1 in accordance with (p0[). 

4.1 Unknown weights 

We consider the case when the mixture weights pi are unknown, which corresponds to the case of the 
unknown transition parameters (P, n) in the general HMM context. 

The Voronoi partition depends on the weight-vector p = (pi, . . . ,Pk) as well as on 9. Hence, S(9,p) 
and the vector p should be reestimated at each step along with 9. Given a Voronoi partition S = 
{Si, . . . , Sk}, the simplest way to estimate the weights pi is to take pi — P n (S{), the empirical measure 
of Si. Hence all the algorithms considered so far can be modified accordingly to include the weight 



estimation as in (24). 



p?' +1 = P„ WV)), l = l,...,K. (24) 

Taking into account the asymptotics, it is easy to correct the estimators pj +1 as well. Indeed, suppose 
6 j = 9* and p> = p, i.e. S(d j ,p>) = S(9*,p) = S* . If n -> oo, then 

V ' V ' JSi(6',p) • JSi(0*,p) 

In general the latter differs from pi. The difference is pi — P(Si(0*,p)). Hence, by analogy with (Q), we 
can define the weight correction D(9,p) — (Di(9,p), . . . , Dk(9,p)) as follows: 

Di(8,p)=pi-J2Pi I fi(x;di)d\, (26) 

i J s,(e, P ) 

which is also data independent. We now summarize the above by giving a formal definition of the adjusted 
Viterbi training with the weight correction. The Viterbi training and the second type of adjusted Viterbi 
training with p unknown can be defined similarly. 

VAl with the weight correction 
1.) Choose 9° = (91, ...,9° K ) and p° = (p§, ...p° K ) 

2.) Given 9 j = (9{, ...,9 3 K ) and p> = [p{, . . .p J K ) define the Voronoi partition S(& ,p>) = {Si, ..., S K } 
as in ( |1^ ) and (|l3|), and the empirical measures Pptfi ,p>) as in (|l4|). 

3.) Put 6 j+1 = jjP{O j ) + A(6"), where jj? is defined in @. 

4.) Put pi +1 = P n (Si{9^p>)) + D(W,p>). 
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5 Simulation studies 

In order to support our theory of adjusted Viterbi Training we simulate 1000 i.i.d. random samples of 
size 1000 according to the following mixture: 

1 (x-Qi) 2 ,„ . (*-<? 2 ) 2 , 

(pe 2 +(l-p)e 2 ). 



The true parameters in our experiments are 9* = (—2.5,0) and (p, 1 — p) = (0.7,0.3). The corre- 
sponding density is plotted in Figure @. Note that for all such mixtures with p > 0.5 and 9\ < 02, 




Fig. 2. -i.(0.7e 



0.3e 2 ). The dashed vertical lines indicate the means of the individual 



components, the dotted line marks the mean of the mixture. 



62 — 8\ < ^/2p/(l — p) (= 2.1602 in our case) implies that the both means fall on one side of the decision 
boundary, which makes detection of the second component particularly difficult as is already becoming 
the case in our setting with 9% — 9\ =2.5. 

Our main goal is to compare the performances of VT, VA1, and EM in terms of the accuracy, con- 
vergence, amount of computations per iteration, and the total amount of computations. We implement 



these algorithms in Matlab ( The MathWorks, Inc. ) , providing a fair comparison of their computational 



intensities based on their execution times. Our code is available for the reader's perusal (Koloydenko and 
Lember 2003| ) and is fully-optimized for speed in the case of VT and EM. Consequently, our simulations 



possibly only overestimate the execution times for VA1. 

Additionally, we compare VA2 with the above algorithms by the accuracy and convergence. We use 
a numerical solver to compute the adjustment function of VA2 and presently make no effort to replace 
this by a computationally efficient approximation. Hence, we do not discuss the computational intensity 
of VA2 in this work. 

In our experiments, the algorithms are instructed to terminate as soon as the L2 distance between 
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consecutive 9 updates falls below 0.001. We also provide a high precision MLE computed with a built-in 

and 



matlab optimization function. The cases of known and unknown weights (§4.1) are considered in §5.1 



£5.2, respectively. We report the following statistics for each of the algorithms in the form: averageione 



standard deviation. 

• 6 = (6i, #2) - the estimates of the means; 

• p - the estimate of the weight of the first component; 

• \\9 — #*||i,2 - L\- and L2-normed distances between 9 and the true parameters; 

• n - number of steps used by the algorithm; 

• T - total time in milliseconds to execute the entire algorithm; 

• t - time in milliseconds to execute one iteration of the algorithm; 

5.1 Known weights 

It is often the case in speech recognition models that the weights are assumed known. 

First, consider (—1, 2) as an "arbitrary" initial guess for 9. Table [j] presents the performance statistics 
based on the 1000 samples. The baseline Viterbi method terminates quickly (on average in 9.04 steps), 
outperformed only by VA2, but is the least accurate among the considered methods. As expected, VT 
also requires fewest computations: 0.2 ms per iteration and 1.85 ms total. Ranked from low to high, 
accuracies of VA1, VA2, and EM appear similar, and are about three times superior to that of VT. In 
units of the VT execution time, EM compares to VA1 as 16.85:6.7 per iteration, and as 20.43:7.59 by 
the total execution times. In order to illustrate the asymptotic fixed point property, we initialize the 
algorithms to (—2.5, 0), the true value of the parameters, see Table g. In this case, as expected, the both 
types of adjustments exit noticeably faster than VT and EM, are comparable in accuracy to EM, and are 
about three times more accurate than VT. Unlike VAl, VA2 or EM, the baseline algorithm, as predicted, 
disturbs the correct initial guess, resulting in an appreciable bias. The times per iteration of VAl and EM 
are similar to as before, and their total times are (in units of the VT time): 5.71 and 16.03, respectively. 

In order to illustrate the idea of the second type of adjustment, we now initialize the algorithms to 
(—3.1229,0.8771), which produces the same decision boundary t = —0.9111 as 9* — (—2.5,0), the true 
values. Table ra collects these results. Note that since VT and VA2 depend on the initial guess only via 
the decision boundary, they produce in this case exactly the same results (disregarding a small rounding 
error) as in the case of the correct initial guess (Table ph. As expected, VA2 now terminates significantly 
faster than its competitors, and accuracy-wise is only slightly superior to VAl and slightly inferior to 
EM. The times per iteration of VAl and EM are similar to as before, and their total times are 7.84 and 
20.83, respectively. 
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Table 1. "Arbitrary" initial guess. 





VT 


VA1 


VA2 


EM 


MEL 


01 


-2.4869±0.0497 


-2.4952±0.0500 


-2.4959±0.0498 


-2.4970±0.0456 


-2.4973±0.0456 


02 


0.2880±0.0732 


0.0099±0.0917 


0.0082±0.0916 


0.0030±0.0757 


0.0024±0.0757 


\\e-8*\\i 


0.3291±0.0844 


0.1138±0.0681 


0.1133±0.0678 


0.0958±0.0562 


0.0958±0.0562 


||0 — 0*|[a 


0.2927±0.0727 


0.0902±0.0537 


0.0899±0.0536 


0.0761±0.0451 


0.0761±0.0451 


n 


9.04±1.55 


10.49±1.61 


7.84±1.59 


11.20±0.42 


N/A 


t 


0.20±0.05 


1.34±0.19 


39.24±1.56 


3.37±0.07 


N/A 


T 


1.85±0.55 


14.04±2.95 


308.57±68.63 


37.79±1.56 


N/A 



Table 2. Correct initial guess. 





VT 


VA1 


VA2 


EM 


MLE 


01 


-2.4904±0.0495 


-2.4973±0.0488 


-2.4973±0.0490 


-2.4973±0.0455 


-2.4973±0.0456 


02 


0.2820±0.0729 


0.0051±0.0880 


0.0052±0.0892 


0.0024±0.0753 


0.0024±0.0757 


110-0*11! 


0.3223±0.0829 


0.1087±0.0661 


0.1102±0.0664 


0.0953±0.0561 


0.0958±0.0562 


||0-0*||a 


0.2867±0.0721 


0.0861±0.0523 


0.0874±0.0525 


0.0756±0.0450 


0.0761±0.0451 


n 


5.56±1.72 


5.06±1.57 


4.73±1.55 


5.69±1.29 


N/A 


t 


0.22±0.02 


1.37±0.05 


42.23±0.94 


3.42±0.08 


N/A 


T 


1.21±0.31 


6.91±2.05 


199.52±65.67 


19.39±4.28 


N/A 



5.2 Unknown weights 



Assume now that the weights are unknown (§ |4.l[ ) and hence need to be estimated along with the means. 
We use the same data and the same three types of conditions as in the case of known weights: Arbitrary 
initialization to (— 1, 2) (Table Eh, initialization to the correct values (—2.5, 0) (Table ph, and initialization 
to (—3.1229,0.8771), an arbitrary point giving rise the correct intercomponent boundary (Table ph. VT 
and the adjusted algorithms VAl and VA2 in this case are implemented with the asymptotic correction 
(p6|). (The maximization in the high precision MLE is now performed in the three variables.) 

The adjusted algorithms now converge 1.7 (VAl) and 2 (VA2) times faster than EM, and, what is 
more remarkable, VAl and VA2 converge even faster than VT. The per iteration times of VAl and EM 
compare as about 1.8:4.8 for all the initializations, and the total times - as 1.8:8.7 (arbitrary guess), 
1.3:6.67 (true values), and 1.73:7.64 (true boundary), all in units of the VT time. VAl and VA2 are 
again at least three times more accurate than VT in 9 estimation and about one standard deviation more 
accurate than VT in the weight estimation. They are also comparable in accuracy to EM. 



15 



Tabic 3. Correct decision boundary. 





VT 


VA1 


VA2 


EM 


MLE 


01 


-2.4904±0.0495 


-2.4954±0.0497 


-2.4973±0.0490 


-2.4971±0.0456 


-2.4973±0.0456 


02 


0.2820±0.0729 


0.0094±0.0909 


0.0052±0.0892 


0.0030±0.0757 


0.0024±0.0757 


\\e-e*\\i 


0.3223±0.0829 


0.1131±0.0668 


0.1102±0.0664 


0.0958±0.0562 


0.0958±0.0562 


\\o-e*\\ 2 


0.2867±0.0721 


0.0897±0.0528 


0.0874±0.0525 


0.0761±0.0450 


0.0761±0.0451 


n 


5.56±1.72 


7.09±1.38 


4.72±1.56 


7.44±0.94 


N/A 


t 


0.22±0.03 


1.35±0.05 


42.37±1.12 


3.42±0.08 


N/A 


T 


1.22±0.31 


9.56±1.81 


200.24±66.30 


25.41±3.19 


N/A 



Tabic 4. Unknown weights. "Arbitrary" guess. 





VT 


VA1 


VA2 


EM 


MLE 


p 


0.747 ±0.031 


0.703 ±0.028 


0.702 ±0.028 


0.700 ±0.024 


0.699 ±0.024 


01 


-2.4299 ±0.0753 


-2.4919 ±0.0596 


-2.4930 ±0.0594 


-2.4976 ±0.0531 


-2.4992 ±0.0532 


02 


0.3944 ±0.1178 


0.0194 ±0.1099 


0.0173 ±0.1094 


0.0070 ±0.0944 


0.0039 ±0.0947 


\\e-e*\\i 


0.4775 ±0.1653 


0.1382 ±0.0851 


0.1372 ±0.0846 


0.1179 ±0.0708 


0.1179 ±0.0710 


\\0-e*\\2 


0.4058 ±0.1237 


0.1084 ±0.0657 


0.1076 ±0.0653 


0.0931 ±0.0558 


0.0931 ±0.0560 


n 


14.16 ±3.60 


13.85 ±3.25 


12.23 ±2.86 


24.90 ±2.60 


N/A 


t 


0.72 ±0.15 


1.32 ±0.09 


39.01 ±1.26 


3.52 ±0.35 


N/A 


T 


10.13 ±3.01 


18.25 ±4.27 


478.19 ±117.67 


87.67 ±12.98 


N/A 



5.3 Summary of the results 

VA1 is consistently close in accuracy to EM which is always superior to Vitcrbi Training: Specifically, in 
estimating the means, the gain in accuracy is about three-fold as measured by L\- and L 2 -distances, and 
in estimating the weights, it is about one standard deviation. 

VA1 always converges almost as fast as VT and noticeably (by 30% in the case of unknown weights) 
faster than EM. 

When the weights are known, an iteration of VAl is about six times longer than that of VT and is 
more than twice as fast as that of EM. By total execution, VAl is at most eight times slower than VT 
and is more than two and a half times faster than EM. 

When the weights are unknown, VAl is at most twice slower than VT and more than two and a half 
times faster than EM, per iteration. It is also about 50% slower than than VT and more than four times 
faster than EM in total times. 

Accuracy of VA2 is consistently between those of VAl and EM, and VA2 additionally converges faster 
than VAl. 



16 



Tabic 5. Unknown weights. Correct guess. 





VT 


VA1 


VA2 


EM 


MLE 


p 


0.737 ±0.030 


0.699 ±0.026 


0.699 ±0.026 


0.699 ±0.023 


0.699 ±0.024 


01 


-2.4526 ±0.0700 


-2.4987 ±0.0555 


-2.4987 ±0.0557 


-2.4991 ±0.0522 


-2.4992 ±0.0532 


e 2 


0.3537 ±0.1114 


0.0058 ±0.1007 


0.0060 ±0.1021 


0.0038 ±0.0925 


0.0039 ±0.0947 


\\9-0*\\i 


0.4212 ±0.1467 


0.1244 ±0.0782 


0.1263 ±0.0782 


0.1149 ±0.0701 


0.1179 ±0.0710 


||0-0*||2 


0.3626 ±0.1146 


0.0978 ±0.0607 


0.0994 ±0.0607 


0.0907 ±0.0553 


0.0931 ±0.0560 


n 


8.53 ±3.47 


6.01 ±2.44 


6.27 ±2.40 


11.89 ±4.24 


N/A 


t 


0.74 ±0.04 


1.36 ±0.05 


41.01 ±1.24 


3.53 ±0.06 


N/A 


T 


6.27 ±2.42 


8.13 ±3.19 


257.35 ±99.70 


41.84 ±14.81 


N/A 



Table 6. Unknown weights. Correct boundary. 





VT 


VA1 


VA2 


EM 


MLE 


p 


0.737 ±0.029 


0.702 ±0.026 


0.700 ±0.026 


0.700 ±0.023 


0.699 ±0.024 


01 


-2.4517 ±0.0689 


-2.4941 ±0.0573 


-2.4972 ±0.0556 


-2.4981 ±0.0526 


-2.4992 ±0.0532 


02 


0.3549 ±0.1096 


0.0148 ±0.1050 


0.0087 ±0.1024 


0.0059 ±0.0930 


0.0039 ±0.0947 


\\e-9*\h 


0.4218 ±0.1459 


0.1327 ±0.0779 


0.1271 ±0.0780 


0.1164 ±0.0694 


0.1179 ±0.0710 


II0-HI2 


0.3637 ±0.1132 


0.1043 ±0.0606 


0.0999 ±0.0606 


0.0919 ±0.0548 


0.0931 ±0.0560 


n 


8.16 ±3.40 


7.74 ±2.58 


6.54 ±2.22 


12.98 ±4.22 


N/A 


t 


0.74 ±0.04 


1.34 ±0.04 


41.12 ±1.74 


3.52 ±0.05 


N/A 


T 


5.97 ±2.36 


10.32 ±3.35 


268.96 ±91.44 


45.59 ±14.69 


N/A 



6 Conclusion 

We have considered the problem of parameter estimation of the emission distribution in Hidden Markov 
Models in connection with the two most common estimation algorithms: the EM algorithm and the 
Viterbi Training algorithm. We have identified the sources of bias, or lack of consistency, in VT estimation 
in comparison with EM (MLE) estimation. In the case of HMM, EM computes the MLE, which is 
often consistent. Trading the EM's accuracy for the VT's ease of computations, one loses, among other 
things, the asymptotic fixed point property: VT no longer holds the true parameter values fixed, even 
asymptotically. In this work, we have restored this property, and consequently recovered a certain 
amount of the EM's accuracy, without a significant increase in computations relative to Viterbi Training. 
Specifically, we have derived two types of adjustments to the baseline Viterbi Training algorithm. Our 
first algorithm, VAl, that we also present as the central contribution of this work, is a modification of 
VT that restores the asymptotic fixed point property. Wc also present evidence that, at least in the 
case of mixture models (a special and important case of HMM), the price in extra computations for this 
increase in accuracy can be made reasonable. The second algorithm, VA2, in addition to restoring the 
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fixed point property, also ensures that the true parameters are asymptotically found as soon as the true 
alignment (i.e. Voronoi partition) has been found. This latter feature may require intensive computations, 
undermining Vitcrbi Training as a computationally feasible alternative to EM. We intend to investigate 
feasible approximations to the correction function of VA2 in future work. 

Certainly, the final decision as to which algorithm to use is application dependent, and this work 
presents valuable information to facilitate such selection, especially in the context of the Gaussian mixture 
models. For this special case, we have provided simulation studies based on 1000 large random samples 
which illustrate the key features of the adjusted algorithms in contrast with EM and baseline Vitcrbi 
Training. In our simulations, VAl demonstrates a significant increase of accuracy (three-fold and one 
standard deviation in in estimating the mixture means and weights, respectively) relative to VT. In fact, 
the accuracy of VAl is already comparable to that of EM. Computation-wise, VAl in our studies is still 
several factors faster than EM. We therefore suggest replacing VT by VAl in applications that can afford 
computing (to a variable precision) the correction function in appreciation of the increased accuracy. 
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